install.packages(c(
'dbplyr', # imports a bunch of stuff
'fastLink', # imports stringdist
'igraph',
'duckdb',
'fuzzylink',
'zoomerjoin'
))Eric Manning, PhD
February 27, 2026
Recommended R packages:
Python:
Problem 1: The task is \(O(N_AN_B)\)
For merges, \(N_1 = N_2 = 10,000\) generates 100M comparisons across each of \(k\) linkage fields.
Deduplication: 50M comparisons.
Solution: Some observations are almost certainly not matches, so we don’t need to compare them.
Problem 2: What is a “match?”
| id | first | mi | last | housenum | street | city | zip | st |
|---|---|---|---|---|---|---|---|---|
| 1 | eric | m | manning | 169 | nassau | princeton | 08540 | nj |
| 2 | ericm | manning | 169 | nassau | 08540 | |||
| 3 | e | manning | 169 | nassau | 08540 | |||
| 4 | e | manning | 160 | nassau | 08540 | |||
| 5 | emily | manning | 160 | nassau | 08540 |
Solution: Know your data well. Examine large clusters, esp. those with partial matches.
1. Pre-process input columns
2. Define blocking rules
3. Define comparison criteria
3a. Train an unsupervised probabilistic model?
4. Match by applying comparison criteria to blocked data
5. Resolve matches into underlying clusters (people, entities, etc.)
6. Generate canonical output
Some approaches blend these steps together (say, redefine comparison criteria iteratively based on match quality).
Basic deterministic linkage by first + last name (comparisons) within ZIP codes + gender (blocks)…
-- cols in dfA, dfB: id, first, last, zip, gender (M/F/NULL)
-- return just the indices of the matched pairs
SELECT l.id AS id_l, r.id AS id_r
FROM dfA AS l
INNER JOIN dfB AS r
-- blocking rules:
ON l.zip = r.zip AND COALESCE(l.gender = r.gender, true)
-- comparison filters:
WHERE
jaro_winkler_similarity(l.first, r.first) >= 0.9
AND jaro_winkler_similarity(l.last, r.last) >= 0.9;Typical considerations:
"unknown", "NA", "NULL" "")stringr::stri_trans_general in R.The goal (usually):
| name |
|---|
| ERIC MANNING |
| Eric M Manning |
| MANNING, ERIC M MR. |
| eric m. manning jr |
| name_first | name_middle | name_last | name_suffix |
|---|---|---|---|
| eric | manning | ||
| eric | m | manning | |
| eric | m | manning | |
| eric | m | manning | jr |
install_github('mjfii/Name-Parser')install_github('Nonprofit-Open-Data-Collective/peopleparser')humaniformat::This is generally quite hard if the addresses are not already pre-processed.
Addresses as strings… in two stages:
stringr::str_pad()str_pad('8540', width = 5, side = 'left', pad = '0') == '08540'Addresses are very weird. Please do not attempt to standardize them on your own unless you are willing to dedicate a lot of time to building your own tool.
libpostal wrappers in R, Python, etc.
Rpostal: R bindings to libpostal C libraryusps-scourgify: Lighter Python library for standardizationaddressr: Eviction Lab’s address parser (for R)Or… use any (other) geocoder.
key | value
------------+-----------------
box |
city | BOSTON
name | DEVONSHIRE
qual |
unit | # PENTHOUSE 301
extra |
state | MA
predir |
sufdir |
country | USA
pretype |
suftype | PL
building |
postcode | 02109
house_num | 1
ruralroute |
Geocoding usually comes with standardization:
Please don’t ever pay for an address parser or geocoder (US only).
Geocoding converts messy address strings into structured lat/long coordinates. This means:
tidygeocoderUse the Census API for bulk geocoding (10k batches):
## # A tibble: 1 x 3
## a lat long
## <chr> <dbl> <dbl>
## 1 169 Nassau Street, Princeton, NJ, 08540 40.4 -74.7
You can parallelize batches and the Census API will, generally, cooperate.
geo_out = pbmclapply(
geo_input,
FUN = \(.x) {
geocode(
.x,
street = geo_in, city = city, state = state, postalcode = zip,
method = "census", full_results = TRUE,
api_options = list(census_return_type = 'geographies'),
custom_query = list(vintage = 'Census2010_Current')
)[, keep_vars]
},
mc.cores = parallel::detectCores() - 1, mc.preschedule = F
)tidygeocoder featuresCascading (useful when one service doesn’t cover all your addresses):
## # A tibble: 2 x 4
## a lat long query
## <chr> <dbl> <dbl> <chr>
## 1 169 Nassau Street, Princeton, NJ, 08540 40.4 -74.7 census
## 2 Anfield, Liverpool, UK 53.4 -2.96 osm
ArcGIS will support ~1M geocodes/hour (with address standardization)
OpenStreetMap (OSM) coverage varies by country with rolling updates
IRB/other regulatory limitations to uploading address data
Local solution desirable for data ingestion pipelines (maybe)
A blocking field is (usually) just a join key (i.e., exact match required).
Considerations:
KNOW YOUR DATA WELL ENOUGH TO SELECT GOOD BLOCKS.
There were 3.4M SSA births in 1980. What if we compared everyone to each other?
11.7 trillion comparisons.
What if we block on gender?
5.9 trillion comparisons.
What if we block on gender and first initial?
513 billion comparisons.
What if we block on gender and first name?
47 billion comparisons (~12min w/ DuckDB’s Jaro-Winkler)
What’s wrong with this?
Get creative!
"New York City" vs. "NYC")Madison between 1980 and 2000(l.first IN r.first_nicknames) OR (r.first IN l.first_nicknames)Be sure you handle missingness as intended.
Sometimes we don’t have clean join keys anywhere.
{JOHN, JHON, KOHN} maps to {J500, J500, K500}What about geospatial data?
H3 (Uber) tiles the globe into nested hexagons at 16 resolutions.
Each lat/long = one H3 cell index (a 64-bit integer).
Resolution controls block size (res 6 ≈ 36 km², res 8 ≈ 0.74 km²)
Key idea for blocking: expand each cell to a k-ring (cell + all neighbors within \(k\) steps) and compare only within overlaps.
There are APIs for basically every language.
INSTALL h3 FROM community; LOAD h3;
CREATE TABLE records AS FROM (VALUES
(1, 40.7128, -74.0060),
(2, 34.0522, -118.2437),
(3, 40.7282, -74.0776)
) t(id, lat, lng);
ALTER TABLE records ADD COLUMN h3 UBIGINT;
UPDATE records SET h3 = h3_latlng_to_cell(lat, lng, 3); -- ~68km len
SELECT a.id AS id_a, b.id AS id_b
FROM records a JOIN records b
ON h3_grid_distance(a.h3, b.h3) <= 1 -- overlapping 1-neighbor k-rings
AND a.id < b.id;Levenshtein: Number of insertions, deletions, substitutions
Damerau-Levenshtein: same, but with transpositions
Jaccard (bag of n-grams): \[1 - \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}\]
Jaro: \(\in [0,1]\), function of matching characters and their positions, transpositions
Jaro-Winkler: Upweights matching at beginning of string - Most common for names (developed for them)
Phonetics (soundex, [double-]metaphone) … and many others
More recently, token sort ratio functions (e.g., via RapidFuzz)
| Function | (AMELIA, MALIA) | (BETH, LISBETH) | (JOHN, JHON) |
|---|---|---|---|
| Levenshtein | 0.66 | 0.57 | 0.50 |
| Damerau-Levenshtein | 0.66 | 0.57 | 0.75 |
| Jaccard | 0.80 | 0.57 | 1.00 |
| Jaro-Winkler | 0.88 | 0 | 0.93 |
| Soundex (0 or 1) | 0 | 0 | 1.00 |
2005-03-17, 1005-03-17) OR (2005-03-17, 2023-02-12)1900-01-01 (missing!) or maybe even 1970-01-01 (UNIX epoch)How to handle addresses?
Deterministic output: \(\mathbb{P}(M_{ij}) \in \{0,1\}\)
This assumes you know the criteria that defines a match (well enough).
But what if you don’t? (Or what if you want a posterior probability?)
Basic probabilistic model types:
xgboost, probably)
What’s wrong with the following?
# cols in dfA, dfB: id, first, last, zip, gender
# dplyr will join data.frames, tibbles, etc. on NULLs
inner_join(dfA, dfB, by = c('zip', 'gender'), suffix = c('_l','_r')) |>
filter(
stringdist::stringdist(first_l, first_r, 'jw', p = 0.1) >= 0.9,
stringdist::stringdist(last_l, last_r, 'jw', p = 0.1) >= 0.9
) |>
select(id_l, id_r)
# then join the NULL gender obs in dfA to all of dfB
# and vice versaCREATE TABLE match_inds AS
SELECT
-- return just the indices of the matched pairs:
l.id AS id_l, r.id AS id_r
FROM
dfA AS l
INNER JOIN
dfB AS r
-- blocking rules:
ON l.zip = r.zip AND coalesce(l.gender = r.gender, true)
-- comparison filters:
WHERE
jaro_winkler_similarity(l.first, r.first) >= 0.9
AND jaro_winkler_similarity(l.last, r.last) >= 0.9
;Query is faster and will consume less memory. Will spill to disk if necessary.
con = dbConnect(duckdb('path_to.db', ...))
tbl(con, 'dfA') |>
inner_join(tbl(con, 'dfB'), by = c('zip', 'gender'), suffix = c('_l', '_r')) |>
filter(
jaro_winkler_similarity(first_l, first_r) >= 0.9,
jaro_winkler_similarity(last_l, last_r) >= 0.9
) |>
select(id_l, id_r) |>
compute(
name = 'match_inds',
temporary = FALSE
)Remember! What’s different here vs. base R?
Why does jaro_winkler_similarity(...) work in the call to filter()?
We have a bunch of matched pairs (with posteriors). Match status might not be transitive. Now what?
Most commonly, connected components
Can implement other algorithms (with probabilities as IPWs) if computationally feasible (e.g., label propagation).
onager extension for DuckDB here.In R, generate graph object from match pairs list and use igraph::components(graph_obj).
In DuckDB:
(What’s the answer?)
See here for manual coding example in Python/DuckDB
Household overclustering (!!)
Overclustering is particularly important for descriptive relationships…
In some cases, you may need the “best record” for each cluster.
Some options for inference:
Fellegi and Sunter (1969, JASA) | Enamorado, Fifield, and Imai (2019, APSR)
Do you need this? What is your end goal?
Benefit (0): You don’t define all of the possible “match” criteria combinations across linkage fields or label any label data.
Benefit (1): We formalize assumptions necessary to incorporate posterior match probabilities as weights when merged data is outcome or explanatory variable.
Benefit (2): We can incorporate prior match probabilities from existing information (e.g., migration rates, name frequency weighting).
You must specify blocks, comparison criteria/levels, posterior threshold for a match (sometimes), and (optional) priors.
Several robust, efficient implementations across languages.
Less math-heavy tutorial from Linacre (here).
Suppose we are matching datasets \(A\) and \(B\) across \(K\) fields.
Similarity between \(i \in A, j \in B\) in field \(k\) is \(S_k(i,j)\), typically \(\in [0,1]\).
We coarsen this similarity to an agreement pattern \(\gamma_k(i,j)\) where: \[\gamma_k(i,j) = \begin{cases} 2 & S_k(i,j) \geq \tau_2 \\ 1 & \tau_1 \leq S_l < \tau_2 \\ 0 & \text{otherwise} \end{cases}\]
This is generalizable to distance metrics, discrete metrics (e.g., Levenshtein), etc.
For \(i, j\), we have \(\boldsymbol{\gamma}(i,j) = \{\gamma_1(i,j), ..., \gamma_k(i,j)\}\) with some number of agreement levels \(L_k\).
Missingness is recorded as \(\delta_k(i,j)\) where: \[\delta_k(i,j) = \begin{cases} 1 & \text{if either } i_k \text{ or } j_k \text{ missing} \\ 0 & \text{otherwise} \end{cases}\]
Let \(m\in \{0,1\}\) be the (unknown) match status for \(i,j\), which is distributed
\[ M(i,j) \stackrel{iid}{\sim} \text{Bernoulli}(\lambda) \]
Assume that the probability of each agreement level is distributed
\[\gamma_k(i,j) | M(i,j) = m \stackrel{indep}{\sim} \text{Discrete}(\boldsymbol{\pi}_{km})\]
The parameters \(\boldsymbol{\gamma}, \boldsymbol{\delta}\) are given by the observed data. The observed data likelihood, then, is: \(l_{\text{obsv}}(\lambda, \boldsymbol{\pi} | \boldsymbol{\delta}, \boldsymbol{\gamma}) \propto\)
\[\prod_{i=1}^{N_A} \prod_{j=1}^{N_B} \left\{ \sum_{m=0}^{1} \lambda^m (1-\lambda)^{1-m} \prod_{k = 1}^{K} \left( \prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}}\right)^{1-\delta_{k}(i,j)}\right\}\]
The probability of \((i,j)\) being a match is:
\[\begin{equation*} \begin{split} \xi_{i,j} & = \mathbb{P}(M_{ij} = 1 | \boldsymbol{\delta}(ij),\boldsymbol{\gamma}(i,j)) \\ &= \frac{\lambda \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{k1l}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}}{\sum_{m=0}^{1}\left\{\lambda^m(1-\lambda)^{1-m} \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}\right\}} \end{split} \end{equation*}\]
\[\begin{equation*} \begin{split} \xi_{i,j} & = \text{Pr}(M_{ij} = 1 | \boldsymbol{\delta}(ij),\boldsymbol{\gamma}(i,j)) \\ &= \frac{\lambda \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{k1l}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}}{\sum_{m=0}^{1}\left\{\lambda^m(1-\lambda)^{1-m} \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}\right\}} \end{split} \end{equation*}\]
We need to compute the \(\lambda\) and the \(\pi_{kml}\) parameters, given by:
\[\begin{equation*} \begin{split} \lambda &= \frac{1}{N_A N_B}\sum_{i =1}^{N_A}\sum_{j=1}^{N_B} \xi_{ij} \\ \pi_{kml} &= \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\gamma_k(i,j) = l\}(1-\delta_k(i,j))\xi_{ij}^m(1-\xi_{ij})^{1-m}}{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}(1-\delta_k(i,j))\xi_{ij}^m(1-\xi_{ij})^{1-m}} \end{split} \end{equation*}\]
Iterate until convergence within tolerance.
False positive rate (1 - precision), for some posterior match threshold \(S\):
\[\mathbb{P}(M_{ij} = 0 | \xi_{ij} \geq S) = \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} \geq S \}(1-\xi_{ij})}{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} \geq S \}}\]
False negative rate (1 - recall):
\[\mathbb{P}(M_{ij} = 1 | \xi_{ij} \geq S) = \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} < S \}}{\lambda N_A N_B}\]
F-S proposed three “regions” of the observed distribution of the posterior probabilities:
Most people just apply a threshold (the default is 0.85) and move on.
Recommendation: investigate some pairs in each observed \(\boldsymbol{\gamma}\) combination near some \(S_1\) and apply post-hoc filtering. (Recall household overclustering problem.)
fastLinkvars = c('first', 'middle', 'last', 'house_num', 'street', 'city', 'birth_year')
matches.out <- fastLink(
dfA, dfB,
varnames = vars,
stringdist.match = vars[c(1:3, 5:6)], # use stringdist on these
stringdist.method = 'jw', # or dl, jaro, lv
numeric.match = 'birth_year', # applies abs diff on these
# exact T/F/NA on all variables
partial.match = vars[c(1, 3, 5)], # three agreement levels for partial matches
cut.a = 0.94, cut.p = 0.88, # notable default!
cut.a.num = 1, cut.p.num = 2.5, # is this wise?
threshold.match = 0.85,
... # priors, t-f weighting (one var only),
# force 1-to-1, conditional indep
)fastLink (cont.)blockData() is mostly a split into listssplinkSome reasons you might consider splink as an alternative:
fastLink: 300k obs. in 400 min (2.8 GHz Intel i7 (8 core), 8GB RAM) in ~2017splink: 7M obs. (1B comparisons) in 20 min (c6g.2xlarge, 8vCPU, 16GM RAM) in ~2024
splink modelRecommendation (see here):
predict() stage, where comparisons are cheap and parallelization scales ~linearlysplink\[\mathbb{P}(\text{match | observation}) = 1 - \left[1 + \left(\frac{\lambda}{1 - \lambda}\right) \prod_{k} \frac{m_k}{u_k}\right]^{-1} \]
Expectation step: Use the estimated model parameters to predict which record comparisons are matches
Maximisation step: Use the predictions to re-estimate the model parameters
See Linacre (here) for source, more info.
splinkClass-based (newer):
Dict-based (still supported):
Blend (most likely):
place_comp = CustomComparison(
output_column_name = "place",
comparison_description = "Combo place (ZIP, city)",
comparison_levels = [
cll.CustomLevel(
"(a_zip_l IS NULL OR a_zip_r IS NULL) AND (a_city_l IS NULL OR a_city_r IS NULL)"
).configure(is_null_level=True),
cll.ExactMatchLevel("a_zip", label_for_charts="Exact (zip)"
).configure(term_frequency_adjustments=True),
cll.ExactMatchLevel("a_city", label_for_charts="Exact (city)"
).configure(term_frequency_adjustments=True),
cll.JaroLevel("a_city", distance_threshold=0.9),
cll.ElseLevel()
])Key ideas:
term_frequency_adjustments downweights common values (e.g., “NEW YORK” vs. rare ZIP codes)splink performanceBlocking rules to generate predictions take a salting_partitions argument.
If you have 3 blocking rules, one with salting_partitions = 1: \(1 + 1 + 2 = 4\) partitions
Then: base parallelism (9) \(\times\) 4 = ~36 parallel tasks
Key point: the size of the blocks matters for parallelism and memory. Salting splits large blocks into smaller partitions to improve throughput.
This is a miserable problem, usually compounded by a lack of blocking variables.
“J.P. Morgan Chase” vs. “JPMORGON” vs. “JPM”
"JP MORGAN" and "CHASE BANK" are the same company now, but were not before 2000.
“FACEBOOK” vs. “META PLATFORMS, INC.”
There is a string similarity problem AND an information problem.
If we could create an \(n\)-dimensional representation of the string and incorporate outside information…
Libgober and Jerzak (2024, PSRM): LinkOrgs software and paper
Kaufman and Klevs (2022, Political Analysis): Adaptive Fuzzy String Matching
fuzzylink: “Active learning” via LLMOrnstein (2025, Political Analysis) | fuzzylink on CRAN
Idea: Capture both semantic and lexical similarity. A “supervised” model, but use LLMs to embed and label data.
fuzzylink examplefuzzylink::fuzzylink(
dfA = ceda,
dfB = l2,
by = 'fullname',
blocking.variables = c('last', 'place', 'cntyname'),
record_type = 'person',
instructions = 'The first name comes from a list of candidates for public office in
California. The second name comes from a voter registration file. Bear in mind that
some candidates may run for office under a nickname or middle name, and some records
may contain spelling errors.',
model = 'gpt-5.2',
fmla = match ~ sim + jw,
learner = 'glm'
)EnsembleLink: Zero-shot record linkageTwo-stage pipeline — no training data, no API key required (runs locally):
ensemble_link_blocked())ensemblelink examples# devtools::install_github('snoahdasanaike/ensemblelink/r_package')
library(ensemblelink)
results <- ensemble_link(
queries = c("John Smith", "Jane Doe"),
corpus = c("J. Smith", "Jane M. Doe", "Bob Wilson"),
embedding_model = "Qwen/Qwen3-Embedding-0.6B",
reranker_model = "jinaai/jina-reranker-v2-base-multilingual",
top_k = 30
)Based on Leskovec, Rajaraman, and Ullman, Mining of Massive Datasets, Ch. 3
We want: similar strings → similar hashes (buckets), in ~\(O(N)\) time.
Recall that
\[J(C_1, C_2) = \frac{|C_1 \cap C_2|}{|C_1 \cup C_2|}\]
The pipeline:
This gives us a way to block without exact matching — we can find candidate pairs that are approximately similar, without comparing every pair.
“k-shingles” are n-grams of characters (or words). Here, character-level:
JOHN as 2-grams: {JO, OH, HN}
Now represent every string by its k-shingle set:
\(\text{Jaccard}(\texttt{JOHN}, \texttt{JOHNN}) = \frac{|\{\texttt{JO, OH, HN}\}|}{|\{\texttt{JO, OH, HN, NN}\}|} = \frac{3}{4}\)
{JO, OH, HN} → {2, 1, 7}One-hot encode each string by its character shingles:
| JOHN | JOHNN | |
|---|---|---|
| J | 1 | 1 |
| O | 1 | 1 |
| H | 1 | 1 |
| N | 1 | 1 |
| E | 0 | 0 |
With many unique shingles, this matrix is very sparse and requires a lot of memory to store.
Still \(\frac{N(N-1)}{2}\) pairwise comparisons. Can’t do this on one-hot encoded vectors at scale.
We want to find similar columns while computing small signatures.
Idea: “hash” each column (i.e., each string’s one-hot encoding) so that the hashes all fit in memory.
Key property: for Jaccard similarity, we can find a hash \(h\) such that
\[J(C_1, C_2) \propto \Pr(h(C_1) = h(C_2))\]
This hash function is called the min-hash.
Randomly permute the rows of the one-hot matrix. For each column (string), the min-hash is the index of the first (“minimum”) row with a 1.
Initial shingles-by-docs one-hot matrix:
| \(S_1\) | \(S_2\) | \(S_3\) | \(S_4\) | |
|---|---|---|---|---|
| a | 1 | 0 | 0 | 1 |
| b | 0 | 0 | 1 | 0 |
| c | 0 | 1 | 0 | 1 |
| d | 1 | 0 | 1 | 1 |
| e | 0 | 0 | 1 | 0 |
Three random permutations of the rows:
| \(P_1\) | \(P_2\) | \(P_3\) | |
|---|---|---|---|
| a | 4 | 5 | 1 |
| b | 5 | 1 | 2 |
| c | 2 | 4 | 5 |
| d | 1 | 3 | 4 |
| e | 3 | 2 | 3 |
Yields a signature matrix:
| \(S_1\) | \(S_2\) | \(S_3\) | \(S_4\) | |
|---|---|---|---|---|
| \(P_1\) | 1 | 2 | 1 | 1 |
| \(P_2\) | 3 | 4 | 1 | 3 |
| \(P_3\) | 1 | 5 | 2 | 1 |
For a random permutation \(P_i\), \(\Pr(P_i(S_1) = P_i(S_2)) = J(S_1, S_2)\)
Why? For any random row, the pair \((S_1, S_2)\) is one of:
| Row values | Outcome |
|---|---|
| \((1, 1)\) | count as \(X\) |
| \((0, 1)\) or \((1, 0)\) | count as \(Y\) |
| \((0, 0)\) | skip, move to next row |
\[\Pr(P_i(S_1) = P_i(S_2)) = \frac{X}{X + Y}\]
which is exactly the Jaccard similarity.
So: similarity of \(S_1\) and \(S_2\) \(\approx\) fraction of hash functions in which their signatures agree.
As \(N_{\text{hashes}} \to \infty\), this converges to Jaccard.
Now we have a much denser representation of our matrix. (FYI, 100 permutations is common.)
But permutations and storage are both quite memory-intensive.
Tweak 1: Instead of \(N\) actual permutations, use \(N\) hash functions that map row numbers to buckets.
Universal hashing: \(h(x) = ((ax + b) \bmod p) \bmod N\)
where \(x\) is the row number, \(a, b\) are random integers, and \(p\) is a prime \(> N\).
Example: \(h_1 = (x + 1) \bmod 5\), \(\quad h_2 = (3x + 1) \bmod 5\)
As long as the number of buckets is large enough to avoid too many collisions, this approximates random row permutations without actually permuting.
We can work string by string without storing the full one-hot matrix.
For each string \(S_i\): for each row where \(S_i = 1\), update \(\text{sig}(h, S_i) \gets \min(\text{current}, h(\text{row}))\).
|
Signature matrix:
|
Initialize all entries to ∞.
Row 1 (S₁, S₄ = 1 where h₁=1, h₂=1): update S₁ and S₄ to (1, 1).
Row 2 (S₃ = 1 where h₁=2, h₂=4): update S₃ to (2, 4).
Row 3 (S₂, S₄ = 1 where h₁=3, h₂=2): update S₂ to (3, 2). S₄ unchanged (3>1, 2>1).
Row 4 (S₁, S₃, S₄ = 1 where h₁=4, h₂=0): ...
Row 5 (S₃ = 1 where h₁=0, h₂=3): ...
Problem: Still scanning \(N_{\text{shingles}}\) rows per hash function.
Tweak 2: Only examine \(M\) rows per hash function, where \(M \ll K\). This gives \(\sim K/M\) reduction in compute.
This is an approximation. For two random strings \(S_i, S_j\):
As long as case (1) is rare, this isn’t a serious problem — and we can spend the saved time computing more hash functions.
We have an \(N_{\text{hashes}} \times N_{\text{obs}}\) signature matrix. Say \(250 \times 1\text{M}\): that’s ~1 GB.
But still \(\binom{1\text{M}}{2} \approx 500\text{B}\) comparisons!
We could hash each signature to some value (and then block on that), but we’d only catch items with exactly the same signatures.
We need a way to find approximately matching signatures…
Divide the signature matrix into \(b\) bands of \(r\) rows, where \(b \cdot r = N_{\text{hashes}}\).
“Candidate” pairs: strings that have the same signature in at least one band.
How to find them? Hash each band’s signature to a bucket. Look at buckets with more than one item.
Example: 100,000 strings, 100 signatures → ~40 MB
With \(b = 20, r = 5\), find pairs with Jaccard \(\geq 0.8\):
\[\Pr(C_1, C_2 \text{ match in one band}) = (0.8)^5 \approx 0.328\]
\[\Pr(\text{not similar in ANY band}) = (1 - (0.8)^5)^{20} \approx 0.00035\]
We miss ~1 in 3,000 true pairs.
The probability of becoming a candidate pair at Jaccard similarity \(s\) is:
\[1 - (1 - s^r)^b\]
This is an S-surve.
Didn’t cover: Euclidean LSH for vectors
zoomerjoin in RGreat package for visualizing and implementing LSH.
A (0.1, 0.7, 0.001, 0.99)-sensitive LSH means that strings with similarity less than 0.1 will have a 0.1% chance of being compared, while strings with 0.7 similarity have a 99% chance of being compared.
band_width n_bands
5 26
zoomerjoinzoomerjoinzoomerjoin (1/2) cluster n
1 LINDA 16451074
2 SANDRA 8000394
3 KAREN 7973116
4 PAMELA 7548071
5 HELEN 5752257
6 MARY 5179872
zoomerjoin (2/2) first gender N cluster
1 LINDA F 876697 LINDA
2 DIANE F 368332 LINDA
3 JANE F 154558 LINDA
4 DIANA F 271258 LINDA
5 ANNE F 168358 LINDA
6 DIANNE F 58080 LINDA
7 ANNA F 324602 LINDA
8 ANNIE F 62001 LINDA
9 JANIS F 27071 LINDA
10 LILLIAN F 78254 LINDA
zoomerjoinlsh, the DuckDB versionSyntax: lsh_min(string, ngram_width, band_count, band_size, seed) (docs)
INSTALL lsh FROM community;
LOAD lsh;
CREATE OR REPLACE TEMPORARY TABLE temp_names (
name_a VARCHAR,
name_b VARCHAR
);
INSERT INTO temp_names (name_a, name_b) VALUES
('Charlotte Brown', 'Charlene Browning'),
(NULL, 'Davis Martin'),
('Olivia Thomas', 'Olive Thomason'),
('Alice Johnson', NULL);
SELECT lsh_min(name_a, 2, 3, 2, 123) AS hash FROM temp_names;lshStoring these vectors is very memory-intensive. Instead,
SQL is sometimes the “lower level” language.
splink blocking rulesPutting it all together:
The record linkage/entity resolution literature is enormous. There are many supervised and unsupervised methods, some of which merge some of the steps in our framework.